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Abstract 

We study repeated games played by players -with bounded computational 
po-wer, -where, in contrast to Abreu and Rubisntein [1], the memory is costly. 
We prove a folk theorem: the limit set of equilibrium payoffs in mixed strategies, 
as the cost of memory goes to 0, includes the set of feasible and individually 
rational payoffs. This result stands in sharp contrast to -who proved that 
when memory is free, the set of equilibrium payoffs in repeated games played 
by players "with bounded computational po-wer is a strict subset of the set of 
feasible and individually rational payoffs. Our result emphasizes the role of 
memory cost and of mixing -when players have bounded computational po-wer. 

Key-word: Bounded rationality, automata, complexity, infinitely repeated games, 
equilibrium. 



1 Introduction 

In a seminal -work, Simon [2T], [22] recognized the impact of bounded rationality in 
economic modelization of individual agents and of organizations. In the last few 
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decades an expanding literature studied the implementation cost of strategies in 
strategic interactions (see, e.g., Rubinstein [IS], Chatterjee and Sabourian [S]). One 
particular research question deals with achieving a target outcome as a collusion, co- 
operation, or bargaining outcome by non-sophisticated agents (see, e.g., Chatterjee 
and Sabourian [3], Sabourian |20], Gale and Sabourian [6], and Maenner [9]). 

One common way to model players with bounded rationality is by restricting them 
to strategies that can be implemented by finite state machines, or automata. The 
game theoretic literature on repeated games played by finite automata can be roughly 
divided into two categories. On the one hand, an extensive literature (e.g., Kalai [8], 
Ben Porath [3J, Piccione [16], Piccione and Rubinstein [17], Neyman [10], [TT], [12], 
Neyman and Okada [13], [H], [15], Zemel [23]) study games where the memory size 
of the two players is determined exogenously, so that each player can deviate only to 
strategies with the given memory size. On the other hand, Rubinstein [TB], Abreu 
and Rubinstein [1] and Banks and Sundaram [2] study games where the players have 
lexicographic preferences: each player tries to maximize her payoff, and subject to 
that she tries to minimize her memory size. Thus, it is assumed that memory is 
free, and a player would deviate to significantly more complex strategy if that would 
increase her profit by one cent. 

In practice, the level of complexity that players can use in their strategies is not 
known in advance, either because players do not know each other computational 
power, because players may increase their computational power if they realize that 
such an increase is beneficial, or because players may decrease their computational 
power if the loss caused by this decrease is compensated by the reduced expenses due 
to this decision. 

In the present paper we take a more pragmatic point of view than the two ap- 
proaches mentioned above, and we study repeated games played by boundedly ratio- 
nal players, when the computational power is costly. 

As a motivating example, consider employees' training for a new job. The training 
period enables the employee to cope with situations that he may encounter in the 
future. The longer the training period, the better prepared will be the employee, 
thereby increasing the employer's profit. 

Once the employee starts working, he follows the instructions that he learned, 
and so we can model the employee as a finite state machine. The training period 
dictates the size of the machine, that is, the number of its states, and the training 
itself determines how the machine behaves in various situations. Because training is 
costly, the employer will try to balance between the length of the training period and 
the gains from extended training. 

When employees of different employers interact, say a salesperson and a buyer, 
the evolution of the interaction is dictated by their training. A salesperson, say, may 
interact with buyers of different firms, who undertook different training programs. 
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and therefore follow different finite state machines. Therefore he has some uncertainty 
regarding the finite state machine that the buyer will follow, so that in fact he faces 
a mixed strategy. 

The employers, who plan the training of their respective employees, then face a 
game, where each tries to teach her employees the techniques that best cope with 
the techniques taught by the other employer. As salespersons and buyers interact 
repeatedly, the situation can be modelled as a repeated game played by finite state 
machines, where the goal of each player is to maximize some combination of the 
long-run average payoff and the cost of training. 

To capture situations like the one in the example, we assume for simplicity that 
the players have additive utility: the utility of a player is the sum of her long-run 
average payoff and the cost of her computational power. Formally, for every positive 
real number c, we say that the vector a; G is a c-Bounded Computational Capacity 
equilibrium (hereafter, BCC for short) if it is an equilibrium when the utility of each 
player is the difference between her long-run average payoff and c times the size of its 
finite state machine. 

A payoff vector s G is a BCC equilibrium payoff if it is the limit, as c goes to 
0, of payoffs that correspond to c-bounded computational capacity payoffs, and the 
cost of the machines used along the sequence converges to 0. 

Interestingly, the definition does not imply that the set of BCC equilibrium payoffs 
is a subset, nor a super set, of the set of Nash equilibrium payoffs. 

Our main result is a folk theorem: in two player games, every feasible and indi- 
vidually rational (w.r.t. the min-max value in pure strategies) payoff vector is a BCC 
equilibrium payoff. 

Our proof is constructive: we exphcitly construct equihbrium strategies. The 
equilibrium play is composed of three phases. The first phase, that on the equilibrium 
path is played only once, is a punishment phase; in this phase each player plays a 
strategy that punishes the other player, that is, an action that attains the min-max 
value in pure strategies of the opponent. As in [1], it is crucial to have the punishment 
phase on the equilibrium path; otherwise, players can use smaller machines, that 
cannot implement punishment and lower the cost of their machines. However, if a 
machine cannot implement punishment, there is nothing that will deter the other 
player from deviating. The second phase, called the babbling phase, is also played 
only once on the equilibrium path. In this phase the players play a predetermined 
sequence of action pairs. In the third phase, called the regular phase, the players play 
repeatedly a predetermined periodic sequence of action pairs that approximates the 
desired target payoff. To implement this phase, the players re-use states that were 
used in the babbling phase. In fact, the role of the babbling phase is to enable one to 
embed the regular phase within it, and its structure is designed to simplify complexity 
calculations. It is long enough to ensure that with only low probability a player can 



3 



correctly guess which of the states in the other player's machine are re-used. 

One can describe the equilibrium path by imagining the following meeting between 
two strangers. At first, the strangers exchange threats and vivid descriptions of what 
each one will do to the other if the other does not behave as desired. After they prove 
to each other that they can execute punishment, they indulge in a long small-talk. 
Finally, they go to business, and implement the desired outcome. 

Our paper is closely related to Abreu and Rubinstein pQ, where a characterization 
of the set of equilibrium payoffs is provided when the players have a lexicographic 
preference: subject to maximizing her long-run average payoff each player wishes 
to minimize the complexity of the finite state machine that implements her strategy. 
The main result of [1] is that the set of equilibrium payoffs is the set of all feasible and 
individually rational payoffs (relative to the min-max value in pure strategies) that 
can be supported by coordinated play. The main message of [T] is that a folk theorem 
does not obtain: the set of equilibrium payoffs may be strictly smaller than the set 
of feasible and individually rational payoffs. Our result shows that two properties 
of the model of [1] drive their result. First, [Ij assumes that computational power 
is costless, so that players will deviate to a prohibitively large automaton to gain a 
cent. This is in contrast to our model, where computational power is costly. Second, 
[Tj restricts the players to pure strategies, whereas we allow the players to use mixed 
strategies. 

Abreu and Rubinstein [1] point at a difficulty in using mixed strategies in games 
played by players with bounded computational power: mixing is a complex operation, 
and players with bounded computational power will prefer to use a pure strategy than 
a mixed strategy, thereby saving the cost of mixing. We argue that there are at least 
two interpretations of the model where the use of mixed strategies is natural. First, 
it may happen that the agent playing the game is limited, whereas the player who 
chooses the strategy for the agent does not have limits on her computational power. 
Thus, the complexity of computing the strategy played by the agent can be large, 
and include mixing, while the complexity of implementing this strategy should be 
low. Second, a player may not know the identity of the agent whom her own agent is 
going to face, and therefore she does not know the pure simple strategy which that 
agent is going to use. Alternatively, the other agents who her agent is going to face 
may use different pure strategies. Thus, the player may assume that the other player 
randomly chooses her simple strategy. In our construction the role of mixing is to 
hide the strategy that each agent uses. Whereas in [Ij the players use pure strategies 
to reduce their computational power, which leads to a significantly smaller set of 
equilibrium payoffs, mixing allows the players to use once again complex strategies, 
and the folk theorem is restored. 

The rest of the paper is organized as follows. Section 2 presents the model and 
the main result. The construction of a mixed equilibrium strategy for both players 
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in the particular case of the Prisoner's Dilemma is presented in Section 3. In Section 
4 we explain how the construction is adapted for general two-player games. 

2 The Model and the Main Result 

In this section we define the model, including the concepts of automata, repeated 
games, and strategies implementable by an automaton; we describe our solution con- 
cept of Bounded Computational Capacity equilibrium, and we state the main result. 

2.1 Repeated Games 

A two-player repeated game is given by (1) two finite action sets Ai and A2 for the 
two players, and (2) two payoff functions ui : Ai x A2 ^ and U2 : Ai x A2 ^ ^ 
for the two players. 

The game is played as follows. At every stage t, each player i G {1,2} chooses an 
action a* G A^, and receives the stage payoff Ui{a\, 0^2). The goal of each player is to 
maximize its long-run average payoff limf_>.oo 7 Sj=i '"i('^i) ^2); where {{a{, ai),j G N} 
is the sequence of action pairs that were chosen by the playersQ A pure strategy 
of player i is a function that assigns an action in Ai to every finite history h G 
U^o(^i ^ ^2)*- A mixed strategy of player z is a probability distribution over pure 
strategies. 

2.2 Automata 

A common way to model a decision maker with bounded computational capacity is as 
an automaton, which is a finite state machine whose output depends on the current 
state, and whose evolution depends on the current state and on its input (see, e.g., 
Neyman [TOj and Rubinstein [E]). Formally, an automaton P is given by (1) a finite 
state space Q, (2) a finite set / of inputs, (3) a finite set O of outputs, (4) an output 
function / : Q — )■ O, (5) a transition function g : Q x I ^ Q, and (6) an initial state 

q*eQ. 

Denote by g* the automaton's state at stage t. The automaton starts in state 
= q*, and at every stage t, as a function of the current state g* and the current 

input z*, the output of the automaton 0* = /(g*) is determined, and the automaton 

moves to a new state q^~^^ = g{q^,i^). 

The size of an automaton P, denoted by \P\, is the number of states in Q. Below 

we will use strategies that can be implemented by automata; in this case the size of 

the automaton measures the complexity of the strategy. 

^In general this limit need not exist. Our solution concept will take care of this issue. 
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2.3 Strategies Implemented by Automata 



Fix a player i G {1,2}. An automaton P whose set of inputs is the set of actions 
of player 3 — i and set of outputs is the set of actions of player i, that is, / = A^^i 
and O = Ai, can implement a pure strategy of player i. Indeed, at every stage t, the 
strategy plays the action /(g*), and the new state of the automaton g*"*"^ = g{q^, a\_i) 
depends on its current state g* and on the action al_^ that the other player played at 
stage t. For z = 1, 2, we denote an automaton that implements a strategy of player i 
by Pi. We denote by the set of all automata with m states that implement pure 
strategies of player i. 

When the players use arbitrary strategies, the long-run average payoff needs not 
exist. However, when both players use strategies that can be implemented by au- 
tomata, say Pi and P2 of sizes pi and p2 respectively, the evolution of the automata 
follows a Markov chain with pi x p2 states, and therefore the long-run average payoff 
exists. We denote this average payoff by 7(Pi, P2) G M^. 

A mixed automaton M is a probability distribution over pure automataH. A mixed 
automaton corresponds to the situation in which the automaton that is used is not 
known, and there is a belief over which automaton is used. A mixed automaton defines 
a mixed strategy: at the outset of the game, a pure automaton is chosen according 
to the probability distribution given by the mixed automaton, and the strategy that 
the pure automaton defines is executed. 

We will use only mixed automata whose support is pure automata of a given size 
m. Denote by A^™" the set of all mixed automata whose support is automata in 
"PJ", and by A^j = Umem-MT the set of all mixed automata whose support contains 
automata of the same size. If Mi G we say that m is the size of the automaton 
Mi. Thus, the size of a mixed automaton refers to the size of the pure automata in 
its support (and not, for example, to the number of pure automata in its support). If 
we interpret each pure automaton clS cLll db gent's type, and a mixed automaton as the 
type's distribution in the population, then the size of the mixed automaton measures 
the complexity of an individual agent, and not the type diversity in the population. 

When both players use mixed strategies that can be implemented by mixed au- 
tomata, the expected long-run average payoff exists; it is the expectation of the 
long-run average payoff of the (pure) automata that the players play: 

7(Mi,M2) ■.= EM,,M2biPi,P2)]- 

^To emphasize the distinction between automata and mixed automata, we call the former pure 
automata. 
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2.4 Bounded Computational Capacity Equilibrium 

In the present section we study games where the utihty function of each player takes 
into account the complexity of the strategy that she uses. 

Definition 1 Let c > 0. A pair of mixed automata (Mi, M2) is a c-BCC equilibrium, 

if it is a Nash equilibrium for the utility functions f/f (Mi, M2) = 7j(Mi, M2) — c|Mj|, 

ie{l,2}. 

If the game has an equilibrium in pure strategies, then the pair of pure automata 
(Pi,P2), both with size 1, that repeatedly play the equilibrium actions of the two 
players, is a c-BCC equilibrium, for every c > 0. 

The min-max value of player i in pure strategies in the one-shot game is 

Vi := min max Mj(aj, as-j). 

An action a3_j that attains the minimum is termed a punishing action of player 3 — z. 

To get rid of the dependency of the constant c we define the concept of a BCC 
equilibrium payoff. A payoff vector x is a BCC equilibrium payoff ii it is the limit, as 
c goes to 0, of the payoff that corresponds to c-BCC equilibria. 

Definition 2 A payoff vector x = {xi,X2) is a BCC equilibrium payoff if for every 
c > there is a c-BCC equilibrium (Mi(c), M2(c)) such that limc-_5.o f^'^(Mi(c), M2(c)) = 
X and limc-s-o cMi(c) = 0. 

The condition limc^o U'^{Mi{c), M2(c)) = a; in the definition of a BCC-equilibrium 
payoff ensures that x can be supported as an equilibrium, while the condition limc_s.o cMi{c) = 
ensure that the cost of the automata that support this equilibrium is negligi- 
ble. In particular, the limit of the long-run average payoffs also converges to x: 
lim,^o7(^i(c),M2(c)) = x. 

It follows from the discussion above that every pure equilibrium payoff is a BCC 
equilibrium payoff. Using Abreu and Rubinstein's p.] proof, one can show that any 
individually rational payoff (relative to the min-max value in pure strategies) that 
can be generated by coordinated play is a BCC equilibrium payoff. For the formal 
statement, assume w.l.o.g. that \Ai\ < \A2\. 

Theorem 3 (Abreu and Rubinstein, 1988) Let a : Ai ^ A2 be a one-to-one 
function. Then any payoff vector x in the convex hull 0/ {u(ai, cr(ai)), ai G Ai} that 
satisfies Xi > Vi for i = 1,2 is a BCC equilibrium payoff. 
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2.5 The Main Result 



The set of feasible payoff vectors is 

F := conv{ii(a), a G Ai x A2}. 

Tlie set of strictly individually rational payoff vectors (relative to the min-max value 
in pure strategies) is 

V := [x = {xi,X2) G M^: Xi > fi,X2 > ^2} • 

Our main result is the following folk theorem, that states that every feasible and 
strictly individually rational payoff vector is a BCC equilibrium payoff. 

Theorem 4 Every vector in F CiV is a BCC equilibrium payoff. 

Observe that Theorem H] is not a characterization of the set of BCC equilibrium 
payoffs, because it does not rule out the possibility that a feasible payoff that is 
not individually rational (relative to the min-max value in pure strategies) is a BCC 
equilibrium payoff. That is, we do not know whether threats of punishments by a 
mixed strategy in the one-shot game can be implemented in a BCC equilibrium. 

Theorem H] stands in sharp contrast to the main message of Abreu and Rubin- 
stein [I], where it is proved that lexicographic preferences, which is equivalent to an 
infinitesimal cost function c, implies that in equilibrium players follow coordinated 
play, so that the set of equilibrium payoffs is sometimes smaller than the set of feasi- 
ble and individually rational payoffs. Our study shows that the result of Abreu and 
Rubinstein [Ij hinges on two assumptions: (a) memory is costless, and (b) the players 
use only pure automata. Once we assume that memory is costly, and that players 
may use mixed automata, the set of equilibrium payoffs dramatically changes. 

2.6 Comments and Discussion 
2.6.1 On the definition of BCC equilibria 

The definition of BCC equilibrium is analog to the definition of Nash equilibrium; in 
both we ask whether a specific behavior (that is, a pair of strategies) is stable. Thus, in 
a c-BCC equilibrium we assume that each player already has an automaton with which 
she is going to play the game, and we ask whether playing this automaton is the best 
response given the automaton that the other player is going to use. As in the definition 
of Nash equilibrium, we do not ask how the players arrived at these automata, and 
we do not restrict the sizes of these automata (though the memory cost does bound 
the maximal size of automaton that the players will use). In principle it may well 
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be that some BCC equilibrium payoff can be supported only with prohibitively large 
automata, which we would like to rule out. That is, we may want to add the size 
of the automata that the players use to the definition itself. In our construction 
(see the proof of Theorem H]), to support a c-BCC equilibrium payoff that is close to 
some target payoff x we use two automata of similar sizes; the size of the automaton 
is related to both c and to the level of approximation to the target payoff: as c 
gets closer to 0, and as the c-BCC equilibrium payoff gets closer to x, we use larger 
automata. 

2.6.2 BCC equilibria and Nash equilibria 

Theorem H] states that every feasible and individually rational (w.r.t. the max-min 
value in pure strategies) payoff vector is a BCC equilibrium payoff. This theorem does 
not rule out the possibility that there would be a payoff vector that is not individually 
rational that would still be a BCC equilibrium; that is, a BCC equilibrium payoff need 
not be a Nash equilibrium payoff. The theorem also does not rule out the possibility 
that some payoff vector that is individually rational w.r.t. the max-min value in mixed 
strategies, but not individually rational w.r.t. the max-min value in pure strategies, 
would not be a BCC equilibrium payoff, so that a Nash equilibrium payoff need not 
be a BCC equilibrium payoff. 

Moreover, in zero-sum games it is not clear whether there is a unique BCC equi- 
librium payoff. If in zero-sum games there always is a unique BCC equilibrium payoff, 
then this quantity can be called the BCC value of the game. However, it is possible 
that in zero-sum games there will be more than one BCC equilibrium payoff, in which 
case even in this class of games, the outcome will crucially depend on the relative 
computational power the players have. 

2.6.3 A more general definition of a BCC equilibrium 

The definition of c-BCC equilibrium assumes that the utility of each player is additive, 
and that the memory cost is linear in the memory size. There are applications where 
the utility function Ui has a different form. 

• Players may disregard the memory cost, but be bounded by the size of memory 
that they use. 



This situation occurs, e.g., when players are willing to invest huge amount of 
money even if the profit is low, but the available technology does not allow 





9 



them to increase their memory size beyond some hmit. Such situation may 
occur, e.g., in the area of code breaking, where countries invest large sums of 
money to be able to increase the number of codes of other countries that they 
break, and they are only bounded by technological advances. 

• Memory is costly, yet players do not save money be reducing their memory 
size. That is, a pair of mixed automata (Mi, M2) is a c-BCC equilibrium if for 
each i G {1,2} and for every pure automaton Pi e Mi one has '-fi{Mi, M^^i) > 
^i{P^,M3_,), and, if Pi > M„ one has 7i(M,,M3_,) > ^,{Pi,Ms_i) - c(|P,| - 
|Mj|). This situation occurs, e.g., when the players are organizations whose size 
cannot be reduced. 

It may be of interest to study the set of equilibrium payoffs for various utility 
functions Ui, and to see whether and how this set depends on the shape of this 
function. 

2.6.4 More than two players 

The concept of BCC equilibrium payoff is valid to games with any number of players. 
However, Theorem H] holds only for two-player games. One crucial point in our con- 
struction is that if a deviation is detected, a player is punished for a long (yet finite) 
period of time by a punishing action. When there are more than two players, the 
punishing action of, say, player 1 against player 2 may be different that the punishing 
action of player 1 against player 3. It is not clear how to construct an automaton that 
can punish each of the other players, if necessary, and such that all these memory 
cells will be used on the equilibrium path. 

2.6.5 BCC equilibria in one-shot games 

The concept of BCC equilibrium that we presented here applies to repeated games. 
However, the concept can be naturally adapted to one-shot games as welo. For 
example, consider the following game, that appears in Halpern and Pass [7j. Player 
1 chooses an integer n and tells it to player 2; player 2 has to decide whether n is a 
prime number or not, winning 1 if she is correct, losing 1 if she is incorrect. Plainly 
the value of this game is 1: player 2 can check whether the choice of player 1 is a 
prime number. However, as there is no efficient algorithm to check whether an integer 
is a prime number, it is not clear whether in practice risk-neutral people would be 
willing to participate in this game as player 2. 

•^Wc thank Ehud Kalai for drawing our attention to this issue. 



10 



The concept of BCC equilibrium can be applied in such situations, and one can 
study the set of BCC equilibrium payoffs, and how this set depends on the relative 
memory cost of the two players. 

In the context of the Computer Science literature one could conceive of an analog 
solution concept, where automata are replaced by Turing machines, and the memory 
size is replaced by the length of the machine's tape. 



3 BCC Equilibria in the Prisoner's Dilemma 

In the present section we prove Theorem H] for the Prisoner's Dilemma. The construc- 
tion in this case contains all the ingredients of the general case, yet the simplicity 
of the Prisoner's Dilemma allows one to concentrate on the main aspects of the con- 
struction. In Section m we indicate how to generalize this basic construction to general 
two-player repeated games. 

The Prisoner's Dilemma is the two-player game depicted in Figure 1, where each 
player has two actions: Ai = A2 = {Cooperate, Defect}. 

Player 2 
D C 



Player 1 



D 
C 



1,1 


4,0 


0,4 


3,3 



Figure 1: The Prisoner's Dilemma. 

The min-max level of each player is 1, and the punishing action of each player 
is D. The set of feasible and (weakly) individually rational payoffs appear in Figure 
2. It is equal to the quadrilateral W with extreme points (1,1), (1,3|), (3,3) and 
(3|,1). 

Figure 2: The feasible and individually rational payoffs in the Prisoner's Dilemma. 

We now show that every feasible and individually rational payoff vector a; is a 
BCC equilibrium payoff. In the construction we do not use the special structure of 
the payoff matrix; all we use is that each player has two actions, and that D is the 
punishing action of both players. 

Observe that each point in W can be written as a convex combination of three 
vectors in the payoff matrix, (3,3), (1, 1), and either (0,4) or (4,0). Assume w.l.o.g. 
that the latter holds, so that 



X = ai(l, 1) + 02(4, 0) + 03(3, 3), 
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where ai + 02 + as = 1 and ai, a^, 03 > 0. 

Our goal is to define two sequences of mixed automata {M\{}x)\ and (M2(/c))fc, 
tliat support BCC equilibrium payoff: the long-run average payoff under 

(Mi(A;), M2(A;)) will converge to x. The road-map of the proof is as follows. We 
fix /J e N, and we define a play path that depends on k and that will be the equi- 
librium path under (Mi(A;), M2(A;)) (Section l3.ip . We then calculate a lower bound 
to the complexity of the play path for each player (the complexity is of the order 
/c^, see Section [H^ . Recall that the complexity of a play path w.r.t. a player is the 
size of the smallest automaton for that player that can implement this play path, 
provided the other player follows her part in the play path. We then construct, for 
each player, a family of pure automata with this smallest size that implement the 
play path (Sections 13.31 and 13.41) . We let the mixed automaton of each player choose 
one of these pure automata, and finally we prove that each of these mixed automata 
is a 2(fc)-BCC best reply against the other, where limfc_>.oo ^(^) = (see Sections 13.51 
andM. 



3.1 The Equilibrium Play 

We fix throughout a natural number A;, sufficiently large to satisfy several conditions 
that will be set in the sequel. Let A;o be the largest integer that satisfies (fco)^ + ^o < ^• 
We here define a specific play path u* that will be the equilibrium path. 

We approximate (ai,a2,a3) by rational numbers with denominator fco; that is, 
let (k\, ks) be three natural numbers that satisfy (a) ki + k2 + = ko, and (b) 
|||^ — ttjil < for j = 1,2,3. Let A; be a sufficiently large integer such that there 
are at least /c2 prime numbers larger than ^2 and smaller than k — ki. Because the 
number of prime numbers smaller than k is approximately j^^, k is of the ordeiQ of 
k2 \n{k2). 

Let uq be the following play of length ko that generates a payoff close to x: 

ujo = kix {D, D) + k2X {D, C) + k^x (C, C) (2) 

(D,D),--- ,(D,D), (D,C),--- ,(D,C), (C,C),--- ,(C,C). 

' (3) 



ki times k2 times k^ times 

Here, the notation n x a means a repetition of n times the action pair a, and ui + UJ2 
means the concatenation of uji and 002- Because of the choice of (/ci, /c2, k^), the average 
payoff along ujq is ;y=-close to x. 



Let uj* be the play path that consists of the foUowings three parts: 



''in (b) we require that |||^ ^ < -^j^ rather than |||^ — aj || < to accommodate the case 
as = 0. If as = 0, with the latter requirement we would have e {0, 1}, and there would not be 
k2 prime numbers between k2 and k — ki. 
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• A punishment phase that consists of times playing {D, D). 

• A babbling phase, that consists of 2A; + 1 blocks: in odd blocks (except the last 
one) the players play k times (C, C), in even block they play k times {D,D), 
and in the last block the players play A; + 1 times (C, C). 

• A regular play, in which the players repeatedly play uq- 
Formally, the play path u* is: 

k 



uj- = ex {D,D) +J2i^^ {C,C) + kx {D,D)) + {k + l) X {C,C) 
Punishment " ^ 




Babbling 

The roles of the three phases follows. 

• As in Abreu and Rubinstein [T] , the punishment phase ensures that punishment 
is on the equilibrium path. Because the players minimize their automaton size, 
subject to maximizing their payoff, if the punishment phase was off the equi- 
librium path, players could save states by not implementing it. But if a player 
does not implement punishment, the other player may safely deviate, knowing 
that she will not be punished. In our construction, detectable deviations of the 
other player will lead the automaton to restart and re- implement u*, thereby 
initiating a long punishment phase. The length of the punishment phase, k^, is 
much longer than the babbling phase to ensure that the punishment is severe. 

• The importance of the babbling phase is that it allows us to build up the mixed 
strategy equilibrium. To reach any equilibrium payoff in the convex hull, players 
need to implement sequences of action pairs. Some of them could be played by 
means of some previously used states. Nevertheless this construction may fail 
due to the possible deviation (without punishment) of the opponent. In order 
to avoid this weakness, it must be concealed the position of such re-used states. 
It is here where the use of the mixed strategy plays a decisive role: to hide 
the chosen pure strategy. This set of pure strategies will be characterised by 
the location of the re-used states within a convenient set of states. In our 
construction this is implemented by the babbling phase. 

The babbhng phase which serves two purposes. First, because it is coordinated, 
it is not difficult to calculate its complexity. Second, it is sufficiently long, so that 
to implement the regular phase one does not need new states, but rather one 
can re-use states that implement the babbling phase. Moreover, its long lengths 
ensures that, if the states that are re-used are chosen randomly, to find which 
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states are re-used with non-negligible probability the other player must use a 
very large automaton: to profit by deviating the other player needs to search 
for the re-used states, a task that requires a significantly larger automaton than 
the one she currently uses. 

• On the equilibrium path the regular play will be played repeatedly, so that the 
long-run average payoff will be the average payoff along cuq, which is close to x. 

3.2 The complexity of cj* 

Let a; be a (finite or infinite) sequence of action pairs. We say that a mixed automaton 
Mi of player i is compatible with the play u if, when the other player 3—i plays her part 
in u, the automaton generates the play of player i in oj (with probability 1). Plainly, 
different automata may be compatible with the same sequence u. The complexity of 
u w.r.t. player i is the size of the smallest automaton of player i that is compatible 
with u. This concept was first defined and studied by Neyman [12], who also provided 
a simple way to calculate it. 

Our goal now is to calculate the complexity of u* w.r.t. the two players. 

Lemma 5 The complexity of u* w.r.t. player 1 is + 2fc^ + 1, and its complexity 
w.r.t. player 2 is + 2k'^ + A; + 1. 

In the rest of this subsection we prove that the complexity of u* w.r.t. each of the 
players is at least the quantities given in Lemma O In the next two subsections we 
provide an automaton for player 1 (resp. for player 2) with size k^ + 2k'^ + 1 (resp. 
k^ + 2fc^ + k + 1) that is compatible with u*, thereby completing the proof of Lemma 

El 

We start by recalling Neyman's [12] characterization for the complexity of a play 
w.r.t. a player. 

Denote by ut the sequence u after deleting the first t — 1 elements from the 
sequencell Given a sequence of action pairs w, finite or infinite, define an equivalence 
relation on the set of natural numbers N as follows: t is equivalent (for player i) to t' 
if any automaton of player i that is compatible with Ut is also compatible! with Ut'. 
Denote this equivalence relation by Neyman [12] proved that the complexity of 
u w.r.t. a player is the number of equivalence classes in this equivalence relation. 

''If a; is a finite play, and t is larger than the length of lo, then uJt is an empty sequence of action 
pairs. 

^In particular, the empty play is equivalent to any other play. 
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3.2.1 The complexity of tu* w.r.t. player 1 is at least A;^ + 2k'^ + 1 

The complexity of a sequence is at least the complexity of any of its subsequences 
(Lemma 2 in Neyman [12j). To bound the complexity of u* w.r.t. player 1 we 
calculate the complexity w.r.t. player 1 of the following prefix of u*: 

k 

U}*{1) = e X {D, D) + ^ (A; X (C, C) + A; X {D, D)) + {k + 1) x (C, C). 

n=l 

In a;*(l) the players play a coordinated play, i.e., there exists a one-to-one relationship 
between the actions played by player 1 and the actions played by player 2: in every 
stage either both players play C or both players play D. Therefore, for every t, any 
automaton of player 1 that is compatible with oj^{l) can ignore the actions of player 2. 
Consequently, an automaton of player 1 generates a deterministic sequence of actions. 
This implies that if ti < t2, and ti and ^2 are equivalent (w.r.t. then ^^2(1) 

is a prefix of a;^^(l). 

Because a sequence of A; + l times C appears only at the end of the sequence uj*{1), 
it follows that ^(2(1) is not a prefix of ^(^(1) whenever ti < t2 < + 2k'^ + 1. In 
particular, the complexity of u* to player 1 is at least k^ + 2k'^ + 1. 

3.2.2 The complexity of u* w.r.t. player 2 is at least k^ + 2k'^ + k + 1 

To bound the complexity of u* w.r.t. player 2, we calculate the complexity for player 
2 of the following prefix uj*{2) of u*: 

k 

io*{2) = k^x{D,D)+J2 C)+fcx(D,D)) + (A;+l)x(C, C)+kix{D,D)+lx{D,C). 

n=l 

Apart of the last action pair, the play path a;* (2) consists of a coordinated play. 
Hence, analogously to the analysis for player 1, for every t, any automaton of player 
2 that is compatible with (2) can ignore the actions of player 1. We now count the 
number of equivalence classes of the relation ~a;*(2),2- The sequence 1 x (C, C) + kiX 
{D, D) + lx (C, C) appears along u*{2) only after 'k^ + 2k'^ + k+l stages in u*{2). It 
follows that the number of equivalence classes of ~aj*(2),2 is at least k^ + 2A;^ + A; + 1. 
In particular, the complexity of u* to player 2 is at least A;^ + 2A;^ + A; + 1. 

3.3 An automaton Mi for player 1 

In this section we define a family of pure automata for player 1, all have size k^ + 
2k'^ + 1. Each automaton in the family is compatible with u*. This will prove that 
the complexity of u* w.r.t. player 1 is A;^ + 2A;^ -|- 1, as stated in Lemma 1. In section 
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13.3.51 we define a mixed automaton for player 1 tliat is supported by pure automata 
in tfiis family and that will be part of the d-BCC equilibrium for a proper d > 0. 

The automata in the family are parameterized by two parameters: an integer 
j G {1, 2, . . . , A; — 1} and a set H = {hi, /12, • • • , hk^} of /c2 integers. The range of 
/ii, /12, ■ ■ • , /ifca will be defined in step 3 below where they are used. 

Given a pair {j, H) we proceed to construct a pure automaton P/'^ for player 1. 
For clarity of the exposition, the construction is divided into three steps. We start in 
step 1 by defining transitions that implement the prefix of length + 2fc^ + 1 of a;*. 
In step 2 we add transitions that implement the next k + ki action pairs in a;*, and 
in step 3 we add transitions that implement the rest of u* . In step 1 we will use all 

7 H 

the states of . In step 2 and 3 we will re-use states for implementing the rest of 
bj* . The mixed automaton that we will define later will choose j and H randomly, to 
conceal the states that are re-used. 

The size of the automaton Pl'^ that we construct is k^ ^2k^ + 1. Denote its states 
by the integers Q = {1,2, . . . ,k:^ + 2k'^ + 1} , where 1 is the initial state. 

3.3.1 Step 1: Implementing the prefix of u* of length P + 2k'^ + 1. 
The prefix of length + 2k'^ + 1 oi uj* is: 



This play consists of the punishment phase followed by k pairs of blocks, each 
block is made of a C-block and a D-block (both of length A;). The length of Ui is 
equal to the size of the automaton, and therefore a naive implementation is to have 
one state for each action of player 1 in a;i: state q & Q will implement the g'th action 
pair in uji. Formally, we divide Q to three sets: 

1. = {1,2,...,A;^}: this is the set of all states that implement the punishment 



2. = [j^^Zl{k^ + 2nk + 1, . . . , k^ + 2nk + k} U {k^ + 2P + 1}: this is the set of 
states in all C-blocks. 

3. = [jlzl{k^ + 2nk + k + l,...,k^ + 2nk + 2k}: this is the set of states in all 
D-blocks. 

The output function is: 



k 



uji = k^ X (D, D) + ^ (fc X (C, C) + X (D, D)) + {C, C). 



n=l 



phase. 



fiQ) 



D 

C 



qeQ'^uQ 
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and the transition function is 



g{qj{q)) = q+l, 1 < q < + 2k^ + 1. 

Because the play in cui is coordinated, the transition is defined only if player 2 complies 
with the desired play Ui. Figure 3 illustrates the first step in the construction of 
the automaton P(' . In this figure, the initial state is the dotted circle to the left, 
the white squares correspond to states where the action is D, and the black circles 
correspond to states where the action is C. 

Punishment C-block D-block C-block D-block C 



Figure 3: An implementation of oji. 

3.3.2 Step 2: Implementing the next k + ki action pairs. 

We now add to the automaton P^'^ transitions that implement the next k + ki action 
pairs in oj*, which are 

002 = kx (C, C) + kiX (D, D) 

Here we use the parameter j. Because (a) the play uj2 starts with k x (C, C), and (b) 
each C-block has length k and is followed by a D- block whose length is more than ki, 
we can use the j'th C-block and the following D-block to implement UJ2. Therefore, 
to implement uj2 it is sufficient to add one transition to P/'^, from the last state to 
the beginning of the j'th C-block: 

g{k^ + 2e + 1, C) = + 2(j - l)k + 1. 

Figure 4 illustrate the automaton P[' with this additional transition. 



Punishment C-block D-block C-block D-block C 



Figure 4: The automaton P/'^ after the second step. 
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3.3.3 Step 3: Implementing the rest of uj*. 

We now add to the automaton Pl'^ transitions that implement the next k2 + ^3 action 
pairs in a;*, which are 

a;3 = A;2 X {D,C) + k^ x (C,C), 

and continue to implement the regular play, which is a periodic repetition of cuq. 

Here we use the parameter set H . To implement the k2 repetitions of (D, C) we 
re-use states in a D-block, whose identity is determined by the set H . Thus, whenever 
in a re-used state, if player 2 plays D, the automaton Pf ' assumes that the play is 
in the babbling phase, whereas if player 2 plays C, the automaton assumes that is 
implemented. Because comes after a sequence ki x [D, D), the first re-used state 
must be the ki + l state in the j'th D-block. Because after the sequence k^ x (C, C) the 
play continues with the next repetition of ujq^ namely, with ki x [D, D), the sequence 
ks X (C, C) will be implemented at the end of the j'th C-block. 

Formally, assume that the set H satisfies the following two conditions: 

(Dl) hi^k^ + 2{j -l)k + k + ki + l, and 

(D2) /i2, hs,..., are distinct states in Q^, all different from hi. 
We add the following transitions (see Figure 5): 

g{hn, C) = hn+i, l<n<k2-l, (4) 
g{hk„C) = k^ + 2{j-l)k+{k-ks). (5) 

In Figure 5, re-used states are denoted by triangles. When the automaton P^'^ is at 
such a state it plays the action D; if player 2 plays the action D, the transition is 
to the subsequent (square) state, whereas if player 2 plays C, the transition is to the 
next triangle state. 




A;3 X (C, C) ki X {D, D) k2 x {D, C) 
Figure 5: The j'th C-block and D-block in P/'^. 
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3.3.4 Last step: Deviations. 

By construction, the automaton P/' is compatible witli a;*. In particular, tlie com- 
plexity of u* w.r.t. player 1 is /c^ + 2k'^ + 1 as stated in Lemma [51 We now add 
transitions to detect deviations of player 2 as follows: all transitions that were not 
defined in steps 1-3 lead to state 1. 

Only re-used states accept both actions of player 2; the other states accept only 
the action that is indicated by u* . Because a punishment phase of length begins in 
state 1, any deviation in a non re-used state is followed by a severe punishment. In the 
next subsection we define the mixed automaton that player 1 uses. The parameters 
j and H will be chosen randomly, so that to profit by deviation, player 2 will have to 
learn j or if, and such a learning process requires a large memory. 

3.3.5 Mixed strategy 

We now define the mixed automaton Mi = Mi{k) for player 1. For every n, 1 < n < 
/c2, define 

^„ = {1, 1 + n, 1 + 2n, 1 + 3n, . . . , 1 + k2n}, 

and 

= {k^ + 2{n - l)k + k + ki + h: he Hn}- 

Thus, Hn contains k2 states in the n'th D-block, that are equally spaced, and the dis- 
tance between each two adjacent states is n. Because ko + {ko)"^ < k, there are enough 
states in the D-block to accommodate this construction, and the two conditions (Dl) 
and (D2) (in page [18]) are satisfied. 

Let J = {ji, j2; • • • iifc2} be a collection of k2 different prime numbers in the range 
{k2 + 1, k2 + 2, . . . , k — ki}, which exist by the choice of k. Let Mi be the mixed 
automaton of player 1 that assigns a probability ^ to each of the pure automata 

3.4 An automaton M2 for player 2 

In this section we describe an analog construction to the one we presented in section 
\'S.'S\ for a mixed automaton of player 2. We construct a family of pure automata 
for player 2, all of size k^ + 2k'^ + A; + 1, and all compatible with u* for player 2. 
As for player 1, the automata in the family depend on two parameters, an integer 
j G {1,2, . . . ,k — 1} and a set H of integers. 
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3.4.1 Step 1: Implementing the prefix of length + 2k'^ + k + 1 of u*. 

We start by implementing the prefix of length k^ + 2k'^ + + 1 of a;* by a naive 
automaton with k^ + + k + 1 states. The prefix is: 

k-l 

Ui^k^x {D, D) + Y^{k x{C,C) + kx {D, D)) + {k + l)x (C, C), 

n=0 

and it contains the punishment phase and the babbling phase. As for player 1, 
we define an automaton that implements each action pair in one state. Let Q — 

{1, 2, . . . , /c^ + 2k'^ + k + 1} be the set of states of the automaton, and divide Q into 
three sets, as follows: 

1. — {1,2, . . . ,k^}: this is the set of all states that implement the punishment 
phase. 

2. ^\j''^ll{k^ + 2nk + l, . . . ,k^ + 2nk + k}U{k^ + 2k'^ + l, - ■ ■ , k^ + 2k^ + k + l}: 
this is the set of all states in C-blocks. 

3. = Un=!){^^ + 2nk + k + l,...,k^ + 2nk + 2k}: this is the set of all states in 
£)-blocks. 



D g e U Q^, 

C qeQ^, 



The output function is: 

/(g) = 

and the transition function is 

g{qj{q))^q + h i<q<k^ + 2k'^ + k + i. 

3.4.2 Step 2: Implementing the next ki actions in u*. 

We now add the transitions that implement the next ki actions in u*, which are 
UJ5 = ki X {D,D). To this end, we re-use states in a D-block, and because after cu^ 
player 2 plays C in a;*, we re- use the last ki states that implement a £)- block. So that 
player 1 does not know which D- block is re- used, we use the j'th D-block. Formally, 

g{k^ + 2k^ + k + l,C) = k^ + 2kj - ki. 
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3.4.3 Step 3: Implementing the rest of uj*. 

We now add transitions that implement UQ — k2^ {D, C) + fca x (C, C) + J2'!^=i ^o- To 
implement the sequence ^2 x C) we re- use states in a C-block that are determined 
by the set H = {hi, h2, ■ ■ ■ , hk^}- The first re-used state must be the first state in the 
j + I'th C-block, and therefore hi = + 2kj + 1. Because the second part of u^, 
that is, fcs X (C, C), should lead to the sequence kiX {D, D) that starts Uq, the states 
that implement that part must be the last states in Q] therefore we must have 

— k^ -\- 2k^ + k + 1 — ks. Finally, we require that hi, /i2, hs, . . . , are distinct 
states in C-blocks. 

Transitions are defined as follows: 

g{hn,D) = hn+i, l<n<k2, (6) 
9ihk„D) = hk, + l. (7) 

3.4.4 Last step: Deviations. 

Finally we add transitions to handle deviations in states that are not re-used. All 
transitions that are not defined in steps 1-3, lead to state 1, so that such deviations 
initiate a long punishment phase. 

3.4.5 Mixed strategy of player 2. 

The definition of the mixed strategy M2 — M2{k) is analog to that of Mi. Recall that 

= {1, 1 + n, 1 + 2n, 1 + 3n, . . . , 1 + k2n}, 

and 

Hn = {k^ + 2nk + k + ki + h: he Hn}, 

and that J is a set of k2 distinct prime numbers in the range {A;2 + 1, • • • , A: — A;i}. 
Let M2 be the mixed automaton of player 1 that assigns a probability ^ to each 

of the pure automata P^^'^^^ . 

In the following subsections we show that the sequence (Mi(/c), M2{k))k supports 
BCC equilibrium payoff. That is, we show that (1) the expected long-run 
average payoff under (Mi, M2) is ;^-close to x, (2) no player can profit by deviating 
to a smaller automaton, and (3) we bound the amount a player can profit by deviation 
to a larger automaton. 
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3.5 The expected payoff under (Mi,M2) is close to x 



By construction, the automaton Mi (resp. M2) is compatible with the play uj* for 
player 1 (resp. player 2). Therefore, if the players use these automata the play is uj*, 
and the long-run average payoff is ;^-close to x. 
Define 

- 7(M„ M.) ^ J^^^<D, D',,^-Jl-^.(D, C)^^^Jl^.(C, C). 
3.6 (Mi,M2) is a c-BCC Equilibrium 

In this section we prove that (Mi, M2) is a c-BCC-equilibrium, for every c that satisfies 
fexF' ^ "-^ ^ 2fe3- prove the claims for player 2. The claims for player 1 can 

be proven analogously. Below we denote the state of an automaton of player i at 
stage t by qi{t). 

For / G {1, 2, . . . , denote P[ := Pl''^', so that the support of Mi is P/, , . . . , P^ 
Let y and i/' be the parameters j and H of P{, for / = 1, 2, . . . , A;2. Let P2 be an 
arbitrary pure automaton that implements a strategy of player 2. We denote by cj' 
the play that is generated under (Pf, P2). 

Suppose that the players use the automata {Pi, P2). If P2 is not compatible with 
u* for player 2, then Pi restarts whenever a deviation from u* is detected, and a 
punishment phase starts. Denote by tl^ the stage at the n'th time in which P| visits 
state 1 when facing P2: 

t[ := 1, 

t+i - mm{t>t'„■.q^{t)^l}, n>l. 

By convention, the minimum of an empty set is 00. 

There are two scenarios where player 2 may improve her long-run average payoff. 
One possibility is if there exists n such that < 00 = t^_,_i. Then is the last stage 
in which the automaton P-[ restarts; in other words, this is the last stage in which a 
punishment phase starts. If the play after stage tj^ is different than cu*, it means that 
player 2 plays as if she knows f and/or H\ and she might use this information to 
improve her payoff. Another possibility is that (t^)„gN are finite and between two of 
these stages the average payoff of player 2 is higher than X2 (in fact, if (t^)„eN are 
finite then, so that player 2 improves her payoff, the average payoff between and 
tl^_^i — 1 should be higher than x* infinitely often). 

This leads us to the following definition. 

Definition 6 The automaton P2 fools the automaton Pi if either one of the following 
conditions hold: 
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CI ) There tiq G N such that t^^^ < oo = tJig+i and u\^^ ^ uj* . 

C2) < oo for every n G N, and there no G N such that the average payoff for 
player 2 between stages t^^^ and t^no+i — ^ is strictly highe% than . 

If condition CI holds, we say that P2 fools P{ in stages + 1, • • •}• If 

condition C2 holds, we say that P2 fools P{ in stages {t^^^, tj^^ + 1, . . . , — !}• In 
both cases we set t[ = t^^^, and we say that at stage t[ player 2 starts to fool 
Denote by Ri = {^'2(^1), 92(^1 + 1), ■ " " > Q2{tl + A;^ — 1)} the states that P2 visits at 
the beginning of the period in which it fools P^ We will prove below that the sets 
(P;)5Li are disjoint, thereby bounding from below the size of any automaton of player 
2 that obtains high payoff when facing Mi. 

Neither CI nor C2 imply that the long-run average payoff under (P/, P2) is higher 
than Yet, as the next lemma shows, the converse is true: if the long-run average 
payoff of player 2 under (P^, P2) exceeds X2, then P2 must have fooled P^. 

Lemma 7 //P2 does not fool Pi then 72(Pi,P2) < 2:2- 

Proof. Since both Pi and P2 are automata, the long-run average payoff of player 
2 under (P{, P2) exists. Suppose first that < cxd for every n G N. Because P2 does 
not fool Pi, for every n G N the average payoff of player 2 between stages tj^ and t^_,_i 
is at most X2, and therefore 72(Pi,P2) < X2- 

Suppose now that there is no G N such that t^^^ < 00 = t^no+i- Because P2 does 
not fool Pi, we have ul^^^ = u)*, so that the long-run average payoff of player 2 after 
stage t^i^ is Xg, and the result follows. ■ 

Our goal is to relate the number of pure automata P{ that P2 fools to the size of 
P2. In fact, we will prove that the size of P2 is at least A;'^ times the number of pure 
automata Pi that P2 fools. To this end, we now prove that if P2 fools both P{^ and 
P|^, then P/j and Ri^ are disjoint: the automaton of player 2 uses different states to 
fool each of the two automata. 

Lemma 8 Let 1 < h < k < k2. If P2 fools both PI^ and PI\ then Ri^ n Pj^ = 0. 

The subtle definition of {ji, Hi)^^^ is the key ingredient in the proof of Lemma [HI 
An immediate corollary of Lemma [8] is: 

Corollary 9 Denote by Lq the number of pure automata Pi that P2 fools. Then 
IP2I > PoA;^ 

^Observe that in this case t^^g+i > + fact, a stronger bound can be obtained. 
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Proof of Lemma [51, 

Step 1: If P2 fools Pi then the states in Ri are distinct: \Ri\ = k^. 

At stage t[ the automaton Pi restarts; it expects the sequence x (D, D) + k x 
{C,C), and none of the states {1, 2, . . . , fc^ + 1} of Pi is re-used. Because this play 
is coordinated, its complexity is k^ + 1, and therefore player 2 must use at least k^ 
distinct states to implement its prefix of length k^. 

Step 2: If Ri^ and Ri^ are not disjoint, then gsl^l' + A;^ - 1) = q2{t[^ + k^ - 1): the 
last state in Ri-^ coincides with the last state in Ri^. 

Suppose that Ri^ and Ri^ are not disjoint, and assume that g2(^l^ + '^i) = (l2(t''^ + 
^2). We argue that necessarily ni = n2. Indeed, assume to the contrary that rii < n2- 
Because in the k^ stages that follow stage tj,^ the automaton plays D, and in the 
k^ stages that follow stage the automaton plays D, the automaton P2 receives 
the same inputs (when facing PI^ after stage and when facing after stage t^^), 
so that it evolves in the same way: q'2(^i^ + ni + s) = q2{t^i + n2 + s) for every s 
that satisfies 1 < s < A;'^ — ^2. Because P2 fools P|^, the action P2 plays in state 
^2(4^ + ni + A;^ — 77,2 + 1) is D. Because P2 fools P{^, the action P2 plays in state 
q2{t^^+n2 + k:^-n2 + l) is C. But ^2(^1' +^1 + A;^ -n2 + 1) = ^2(^1' +"-2 + A;^ -^2 + 1), 
a contradiction. 

Because in the first k^ stages after visiting stage 1, both P|^ and P|^ play in the 
same manner (both output D), it follows that the evolution of P2 when facing either 
PI^ or PI^ is the same. The claim follows. 
Step 3: Ri^ n Ri, = 0. 

Assume to the contrary that P;^ and P/j are not disjoint. Denote by (ji. Hi) and 
(j2,-f^2) the parameters {j,H) of Pi and P|^ respectively. By Step 2, the last state 
in Ri^ coincides with the last state in Ri^. Both automata P{^ and P{^ continue in 
the same way, until one of them observes a deviation, in which case it restarts. 

Denote by tl^ + n the first stage after stage t^^ in which P2 deviates from u* when 
facing PI^ . Because Ri^ = Ri^ , the first stage after stage in which P2 deviates from 
CO* when facing P|^ is t^^ + n. Because P2 fools both P|^ and P|^, the state that P|^ 
visits in stage + n is a re-used state, as is the state that P|^ visits in stage t^^ + n. 
Because the re-used states in P|^ are in the ji'th D-block, while the re-used states 
in P|^ are in the i2'th D-block, the automata P|^ and P|^ are both in re-used states 
only when they implement the action pairs {D, C), that is, in the second part of the 
regular play ojq. 

Let us now verify that P2 cannot fool both P{^ and P|^. Because in a D-block 
both automata P{^ and P{^ play D unless a deviation is detected and a punishment 
phase starts, the evolution of P2, when facing either P{^ or P{^ is the same, as long as 
these automata are in the D-block. It is therefore sufficient to show that there is no 
sequence of actions of player 2 that differ from the play of ui* in D-blocks, and that 
does not initiate a punishment phase when facing either 

ph 

or Pl\ 
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Because Hi (resp. H2) contains k2 numbers, equally spaced with distance ji 
(resp. j2), the difference hr^ — h^^ of pairs of elements in Hi (resp. H2) is a multiple 
of ji (resp. j2)- Because ji and j2 are prime numbers larger than A;2,the differences 
generated by Hi are different than those generated by j2- It follows that the unique 
two sequences of actions of player 2 that does not initiate a punishment phase neither 
when facing in block ji nor when facing in block j2 are (a) a repetition of k2 
times C, and (b) a repetition of k — ki times D. Because P2 deviates from u*, only 
the sequence in (b) should be considered. 

Now, in all blocks after block ji (resp. ^2), the automaton Pi (resp. F{^) does 
not re-use states. Because P2 fools both Pj^ and Pf^, it must follow the play indicated 
by these automata. However, because ji 7^ j2, when the first of these two automata 
reaches its last state, that automaton initiates a punishment phase if P2 plays D, 
while the other initiates a punishment phase if P2 plays C. This implies that if P2 
plays the sequence in (b), then it cannot fool both Pi and as desired. ■ 

Recall that the min-max value in pure strategies of both players is 1. Therefore, 
min{xj — 1,X2 — 1} > is the minimal difference between the target payoff x* and 
the min-max value. We now prove that player 2 cannot profit by deviating to an 
automaton smaller than M2. 

Lemma 10 Let r] < x^ — l, and assume that k is sufficiently large so that ^ + 1 < f ■ 
Let P2 be an automaton for player 2 with size smaller than k^ + 2/c^ + A; + 1. Then 
72(Mi,P^) -c|P^| < 72(Mi,M2) -CIM2I, provided c< ^. 

Proof. Because the complexity of u* w.r.t. player 2 is A;^ + 2k'^ + k + 1, the play 
under (^(,^2) is not u*. By Lemma [HI and because the size of P2 is smaller than 
2k^, the automaton can fool at most one of the automata (P/). Because it cannot 
generate u*, any automaton which P2 does not fool restarts after at most /c^ + 2k'^ + k 
stages, so that the average payoff is at most fc3_|_2fc2-|-fc + '^ k^+lk^^+k - follows that the 
expected payoff 72 (Mi, P2) is at most 

1 k2-l 2k'^ + k k2~l k^ 4 8 

4 h 4:— h — < IH h -. 

k2 k2 k^ + 2k^ + k k2 k^ + 2k^ + k- k2 k 

Because the size of the automaton M2 is k^ + 2fc^ + k + 1, the gain of reducing the 
size of automaton from IM2I to IP2I is at most c{k^ + 2k^ + k). So that player 2 does 
profit by this deviation, we need to require that 

xl > 1 + ^ + l + c{k^ + 2k^ + k), 
^ - k2 k " 

and therefore it is enough to require that 

4 8 

xl-l>r]>— + - + c{k^ + 2k'^ + k). 
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The right-hand side inequahty holds provided 

4_ _ 

k2 k 



^ ' ko k 
C < 



A;3 + 2F + A;' 



so it is enough to require that c < m 

We finally prove that player 2 cannot profit by deviating to an automaton larger 
than M2. 

Lemma 11 Let P2 be a pure automaton such that 'j2{Mi{k) , P2) > Then'y2{Mi, P2) — 
c\P^\ <72(Mi,M2)-c|M2|, provided c>^^. 

Proof. Let Lq be the number of pure automata (Pi) that P2 fools. Because 
72(Mi,P2) > ^2 have Lq>1. If P2 fools Pi, player 2's long-run average payoff is 
at most 4, the maximal payoff in the game. If P2 does not fool P{, player I's long-run 
average payoff is at most X2- The expected long-run average payoff of player 2 then 
satisfies 

72(Mi, F2) < 4^ + X2 — <X2 + 3-—. 

K2 ^2 "^2 

By Corollary [H] we have IP2I ^ L^k^ , and therefore 

Therefore, as soon as c > player 2 does not profit by this deviation. ■ 



To summarize, given a feasible and an individually rational payoff vector x*, we 
choose rj G (0,min{a;* — 1,^2 — 1}). Let c > be sufficiently small, and let k = kc 
satisfy < c < gp-. Then the automata {Mi{k), M2{k)) form a c-BCC equilibrium. 
Since the size of the automata Mi(k) and M2{k) are k^ + 2k'^ + 1 and k^ + 2k'^ + k + l, 
if for each A; > 1 we set Ck = then ^ < ^k < ^ and CkMi{k) and CkM2{k) 
are both smaller than which goes to as A; goes to infinity (and goes to 0). It 
follows that X* is a BCC equilibrium payoff. 



4 The General Case 

In Section [3] we proved Theorem H] for the Prisoner's Dilemma. In the present section 
we explain how the proof should be adapted to prove the result for arbitrary games. 
In the play path a;*, the punishment phase, as well as the regular play, are similar to 
those in Section [31 and only the babbling phase significantly changes. 
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Assume w.l.o.g. that payoffs are bounded by 1, and let x & F (IV. To rule out 
trivial cases, assume that each player has at least two actions. The vector a; is a 
convex combination of all the entries in the payoff matrix 



X — 



where {aa)a£A are non-negative numbers summing to 1. In fact, by Caratheodory's 
Theorem, x is a convex combination of three entries in the payoff matrix. Instead of 
handling separately each of the alternative configurations of these three entries, we 
find it simpler to handle the general case. 

Fix e > 0, a natural number ko > ^, and a natural number k. Let (fca)aeA be a 
collection of positive integers such that (a) XlaeA ^« ~ ^O; and (b) \ka — aakol < 1. 
Define the regular path 

^0 = (ai,02). 

a=(ai ,a2)GA 

Then the average payoff along uq is within e of x. 

For each i = 1,2, denote by k = \Ai\ the number of actions of player i, and by 
A = {(^h^h ■ ■ ■ -i^i'} her actions. Assume w.l.o.g. that h < I2, and that aj is the 
min-max strategy of player i against player 3 — i. 

The play path u* is defined as follows: 



- k^x{al,al)+ij2J2^'' O + + 1) X «4) 

\j=l m=l J 

(k I2 \ 00 

^ ^ kx{a\,a^)\+{al,al) + Y,^o- 
j=l m=/i+l / j=l 

Both the punishment phase and the babbling phase are longer in this construction 
than in the construction for the Prisoner's Dilemma, yet the punishment phase is 
much longer, to ensure that the payoff that results from a deviation is close to the 
min-max value in pure strategies. 

The complexity of ui* w.r.t. player 1 is A;'* + k^li + 1. Indeed, as in Section 121 the 
complexity of the prefix u*{l) = k^ x {a\, al) + {T.f=i Em=i k x (a^, a^)) + {k + 

1) X [a\, al) is k"^ + k^li + 1, and to implement the rest of the play u* player 1 can 
re-use states that were used to implement uj*{l). One can verify that the complexity 
of u* w.r.t. player 2 is A;^ + k^li + {k + 1) + k'^ih — h)- A similar construction of 
automata Mi and M2 for the two players, that re-uses states to implement the rest 
of a;*, shows that x is a BCC equilibrium payoff. 
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